Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.simplefilter("ignore")

Import the Dataset

In [2]:
mydata = pd.read_csv("vehicle-1.csv")
mydata_copy = mydata.copy()

Read the dataset

In [3]:
mydata.head()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus

Data pre-processing – Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm

Missing Values

In [4]:
mydata.isnull().sum()
Out[4]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

Removal of missing values in dataset

In [5]:
cols = ['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio']
for feature in cols:
    mydata[feature] = mydata[feature].fillna(mydata[feature].median())

Confirmation of missing values

In [6]:
mydata.isnull().sum()
Out[6]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64

Check Outliers

There are no outliers for Compactness
There are no outliers for Circularity
There are no outliers for distance_circularity
There are no outliers for scatter_ratio
There are no outliers for elongatedness
There are no outliers for pr.axis_rectangularity
There are no outliers for max.length_rectangularity
There are no outliers for scaled_radius_of_gyration
There are no outliers for skewness_about.2
There are no outliers for hollows_ratio

There are outliers for radius_ratio
There are outliers for pr.axis_aspect_ratio
There are outliers for max.length_aspect_ratio
There are outliers for scaled_variance
There are outliers for scaled_variance.1
There are outliers for scaled_radius_of_gyration.1
There are outliers for skewness_about
There are outliers for skewness_about.1

In [7]:
for feature in cols:
    plt.figure(figsize = (6,6))
    mydata.boxplot([feature])

Removal of Outliers with zscore

In [8]:
from scipy import stats
mydata['zradius_ratio'] = np.abs(stats.zscore(mydata['radius_ratio']))
mydata['zpr.axis_aspect_ratio'] = np.abs(stats.zscore(mydata['pr.axis_aspect_ratio']))
mydata['zmax.length_aspect_ratio'] = np.abs(stats.zscore(mydata['max.length_aspect_ratio']))
mydata['zscaled_variance'] = np.abs(stats.zscore(mydata['scaled_variance']))
mydata['zscaled_variance.1'] = np.abs(stats.zscore(mydata['scaled_variance.1']))
mydata['zscaled_radius_of_gyration.1'] = np.abs(stats.zscore(mydata['scaled_radius_of_gyration.1']))
mydata['zskewness_about'] = np.abs(stats.zscore(mydata['skewness_about']))
mydata['zskewness_about.1'] = np.abs(stats.zscore(mydata['skewness_about.1']))
In [9]:
mydata.head()
Out[9]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity ... hollows_ratio class zradius_ratio zpr.axis_aspect_ratio zmax.length_aspect_ratio zscaled_variance zscaled_variance.1 zscaled_radius_of_gyration.1 zskewness_about zskewness_about.1
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 ... 197 van 0.273363 1.310398 0.311542 0.401920 0.341934 0.327326 0.073812 0.380870
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 ... 199 van 0.835032 0.593753 0.094079 0.593357 0.619724 0.059384 0.538390 0.156798
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 ... 196 car 1.202018 0.548738 0.311542 1.097671 1.109379 0.074587 1.558727 0.403383
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 ... 207 van 0.295813 0.167907 0.094079 0.912419 0.738777 1.265121 0.073812 0.291347
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 ... 183 bus 1.082192 5.245643 9.444962 1.671982 0.648070 7.309005 0.538390 0.179311

5 rows × 27 columns

In [10]:
df_clean = mydata[mydata['zradius_ratio'] <= 3]
print("%i records have been removed after treating radius_ratio" %(mydata.shape[0]-df_clean.shape[0]))
record = df_clean.shape[0]
print("Total Records - %i" %(record))
3 records have been removed after treating radius_ratio
Total Records - 843
In [11]:
df_clean = df_clean[df_clean['zpr.axis_aspect_ratio'] <= 3]
print("%i records have been removed after treating zpr.axis_aspect_ratio" %(record-df_clean.shape[0]))
record = df_clean.shape[0]
print("Total Records - %i" %(record))
5 records have been removed after treating zpr.axis_aspect_ratio
Total Records - 838
In [12]:
df_clean = df_clean[df_clean['zmax.length_aspect_ratio'] <= 3]
print("%i records have been removed after treating zmax.length_aspect_ratio" %(record-df_clean.shape[0]))
record = df_clean.shape[0]
print("Total Records - %i" %(record))
1 records have been removed after treating zmax.length_aspect_ratio
Total Records - 837
In [13]:
df_clean = df_clean[df_clean['zscaled_variance'] <= 3]
print("%i records have been removed after treating zscaled_variance" %(record-df_clean.shape[0]))
record = df_clean.shape[0]
print("Total Records - %i" %(record))
5 records have been removed after treating zscaled_variance
Total Records - 832
In [14]:
df_clean = df_clean[df_clean['zscaled_variance.1'] <= 3]
print("%i records have been removed after treating zscaled_variance.1" %(record-df_clean.shape[0]))
record = df_clean.shape[0]
print("Total Records - %i" %(record))
1 records have been removed after treating zscaled_variance.1
Total Records - 831
In [15]:
df_clean = df_clean[df_clean['zscaled_radius_of_gyration.1'] <= 3]
print("%i records have been removed after treating zscaled_radius_of_gyration.1" %(record-df_clean.shape[0]))
record = df_clean.shape[0]
print("Total Records - %i" %(record))
0 records have been removed after treating zscaled_radius_of_gyration.1
Total Records - 831
In [16]:
df_clean = df_clean[df_clean['zskewness_about'] <= 3]
print("%i records have been removed after treating zskewness_about" %(record-df_clean.shape[0]))
record = df_clean.shape[0]
print("Total Records - %i" %(record))
4 records have been removed after treating zskewness_about
Total Records - 827
In [17]:
df_clean = df_clean[df_clean['zskewness_about.1'] <= 3]
print("%i records have been removed after treating zskewness_about.1" %(record-df_clean.shape[0]))
record = df_clean.shape[0]
print("Total Records - %i" %(record))
2 records have been removed after treating zskewness_about.1
Total Records - 825

Copy to another dataframe

In [18]:
mydata_copy = df_clean.copy()
a = df_clean.copy()

Check Outliers after Treating with zscore

There is a significant reduction in the outliers for all the columns

In [19]:
col = ['zradius_ratio',
       'zpr.axis_aspect_ratio', 'zmax.length_aspect_ratio', 'zscaled_variance',
       'zscaled_variance.1', 'zscaled_radius_of_gyration.1', 'zskewness_about',
       'zskewness_about.1', 'class']
In [20]:
df_clean.drop(col, axis = 1, inplace = True)
In [21]:
for feature in df_clean.columns:
    plt.figure(figsize = (6,6))
    df_clean.boxplot(feature)

Understanding the attributes - Find relationship between different attributes (Independent variables)

In [22]:
col = ['zradius_ratio',
       'zpr.axis_aspect_ratio', 'zmax.length_aspect_ratio', 'zscaled_variance',
       'zscaled_variance.1', 'zscaled_radius_of_gyration.1', 'zskewness_about',
       'zskewness_about.1']
a.drop(col, axis = 1, inplace = True)
In [23]:
a.describe(include = 'all').transpose()
#There are 3 unique values for the 'class' variable. Car has the highest count
Out[23]:
count unique top freq mean std min 25% 50% 75% max
compactness 825 NaN NaN NaN 93.5794 8.13397 73 87 93 100 119
circularity 825 NaN NaN NaN 44.7636 6.10218 33 40 44 49 59
distance_circularity 825 NaN NaN NaN 81.9612 15.6993 40 70 80 98 112
radius_ratio 825 NaN NaN NaN 167.692 31.8875 104 141 166 194 246
pr.axis_aspect_ratio 825 NaN NaN NaN 61.2048 5.62958 47 57 61 65 76
max.length_aspect_ratio 825 NaN NaN NaN 8.16727 2.22271 2 7 8 10 22
scatter_ratio 825 NaN NaN NaN 168.416 32.5066 112 146 157 198 257
elongatedness 825 NaN NaN NaN 41.0024 7.76177 26 33 43 46 61
pr.axis_rectangularity 825 NaN NaN NaN 20.5394 2.52295 17 19 20 23 28
max.length_rectangularity 825 NaN NaN NaN 147.819 14.4753 118 137 146 159 188
scaled_variance 825 NaN NaN NaN 187.418 29.8745 130 167 178 216 280
scaled_variance.1 825 NaN NaN NaN 436.444 171.521 184 318 363.5 586 957
scaled_radius_of_gyration 825 NaN NaN NaN 174.181 31.9953 109 149 173 198 268
scaled_radius_of_gyration.1 825 NaN NaN NaN 72.0461 6.28464 59 67 71 75 90
skewness_about 825 NaN NaN NaN 6.29333 4.81213 0 2 6 9 21
skewness_about.1 825 NaN NaN NaN 12.503 8.82133 0 5 11 19 39
skewness_about.2 825 NaN NaN NaN 188.962 6.12807 176 185 189 193 204
hollows_ratio 825 NaN NaN NaN 195.678 7.36416 181 191 197 201 211
class 825 3 car 423 NaN NaN NaN NaN NaN NaN NaN

Correlation Values

We find that most of the independent variables are having a very high positive correlation

  • Compactness has a very strong correlation with distance_circularity, circularity, radius_ratio, scatter_ratio, pr.axis_rectangularity
    • Circularity seems to be strongly correlated with scatter_ratio, pr.axis_rectangularity, max.length_rectangularity, scaled_variance, scaled_variance.1 and scaled_radius_of_gyration
    • distance_circularity is strongly correlated with circularity, radius_ratio, scaled_variance, scaled_variance.1
    • scatter_ratio is having a high correlation with compactness, circularity, distance_circularity, radius_ratio, pr.axis_rectangularity, max.length_rectangularity, scaled_variance, scaled_variance.1 and scaled_radius_of_gyration
    • elongatedness is highly negatively correlated to scaled_variance and scaled_variance.1
    • scaled_variance and scaled_variance.1 are almost correlated to 1
In [24]:
a.corr()
Out[24]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
compactness 1.000000 0.682092 0.787279 0.752054 0.217936 0.466257 0.807819 -0.784241 0.809128 0.673479 0.786169 0.810277 0.576358 -0.292685 0.228904 0.153585 0.319140 0.396822
circularity 0.682092 1.000000 0.791339 0.649839 0.220865 0.541831 0.849816 -0.820636 0.846705 0.963347 0.810949 0.839551 0.928987 0.027689 0.148461 -0.015216 -0.089129 0.071057
distance_circularity 0.787279 0.791339 1.000000 0.814886 0.261853 0.628634 0.907811 -0.910227 0.896957 0.771809 0.888030 0.891242 0.702193 -0.282863 0.110752 0.265734 0.161602 0.358179
radius_ratio 0.752054 0.649839 0.814886 1.000000 0.671630 0.446343 0.799829 -0.849616 0.774839 0.586216 0.806588 0.788535 0.560520 -0.428326 0.056774 0.184745 0.435156 0.524234
pr.axis_aspect_ratio 0.217936 0.220865 0.261853 0.671630 1.000000 0.166016 0.225071 -0.321604 0.195656 0.163483 0.246342 0.213431 0.176335 -0.318913 -0.049169 -0.025929 0.402521 0.413940
max.length_aspect_ratio 0.466257 0.541831 0.628634 0.446343 0.166016 1.000000 0.484253 -0.486877 0.482938 0.619196 0.415299 0.448513 0.402707 -0.302671 0.084929 0.137072 0.050980 0.360666
scatter_ratio 0.807819 0.849816 0.907811 0.799829 0.225071 0.484253 1.000000 -0.973118 0.989283 0.809600 0.977457 0.993031 0.792244 -0.047187 0.070262 0.212171 0.029151 0.157142
elongatedness -0.784241 -0.820636 -0.910227 -0.849616 -0.321604 -0.486877 -0.973118 1.000000 -0.950674 -0.773352 -0.967841 -0.957075 -0.761476 0.131259 -0.050381 -0.184390 -0.134281 -0.246276
pr.axis_rectangularity 0.809128 0.846705 0.896957 0.774839 0.195656 0.482938 0.989283 -0.950674 1.000000 0.812765 0.962767 0.987609 0.789106 -0.031687 0.079015 0.214309 0.004923 0.138900
max.length_rectangularity 0.673479 0.963347 0.771809 0.586216 0.163483 0.619196 0.809600 -0.773352 0.812765 1.000000 0.753500 0.796746 0.866823 0.010370 0.137284 0.000332 -0.088717 0.101050
scaled_variance 0.786169 0.810949 0.888030 0.806588 0.246342 0.415299 0.977457 -0.967841 0.962767 0.753500 1.000000 0.974479 0.777800 -0.034697 0.035825 0.203810 0.055896 0.139095
scaled_variance.1 0.810277 0.839551 0.891242 0.788535 0.213431 0.448513 0.993031 -0.957075 0.987609 0.796746 0.974479 1.000000 0.786498 -0.035254 0.072381 0.201066 0.033105 0.146236
scaled_radius_of_gyration 0.576358 0.928987 0.702193 0.560520 0.176335 0.402707 0.792244 -0.761476 0.789106 0.866823 0.777800 0.786498 1.000000 0.173597 0.173601 -0.060997 -0.204915 -0.084896
scaled_radius_of_gyration.1 -0.292685 0.027689 -0.282863 -0.428326 -0.318913 -0.302671 -0.047187 0.131259 -0.031687 0.010370 -0.034697 -0.035254 0.173597 1.000000 -0.090516 -0.138220 -0.843045 -0.915988
skewness_about 0.228904 0.148461 0.110752 0.056774 -0.049169 0.084929 0.070262 -0.050381 0.079015 0.137284 0.035825 0.072381 0.173601 -0.090516 1.000000 -0.047879 0.103547 0.086815
skewness_about.1 0.153585 -0.015216 0.265734 0.184745 -0.025929 0.137072 0.212171 -0.184390 0.214309 0.000332 0.203810 0.201066 -0.060997 -0.138220 -0.047879 1.000000 0.077915 0.204448
skewness_about.2 0.319140 -0.089129 0.161602 0.435156 0.402521 0.050980 0.029151 -0.134281 0.004923 -0.088717 0.055896 0.033105 -0.204915 -0.843045 0.103547 0.077915 1.000000 0.891637
hollows_ratio 0.396822 0.071057 0.358179 0.524234 0.413940 0.360666 0.157142 -0.246276 0.138900 0.101050 0.139095 0.146236 -0.084896 -0.915988 0.086815 0.204448 0.891637 1.000000

Pairplot

From the pairplot, we get almost the same inferences from the correlation matrix.

In [25]:
sns.pairplot(a)
Out[25]:
<seaborn.axisgrid.PairGrid at 0x129e946a0>

Choose carefully which all attributes have to be a part of the analysis and why

We will drop the variables which has a correlation value of > 0.8 because one variable explains the other and there is no need to have both the variables in the same dataset. Here, we require domain experience as well to factor if we are missing out any relevant information while dropping the variables. Based on the correlation numbers, we decide to drop out the following:

elongatedness
pr.axis_rectangularity
max.length_rectangularity
scaled_radius_of_gyration
skewness_about.2
scatter_ratio
scaled_variance
scaled_variance.1

Note: Of course, there will be a higher accuracy if we include all the variables but it includes the noise generated by them as well. We need to strike a balance to not lose the information and on the other hand having minumum variables for the data set. This is a trade off to cater to the curse of dimentionality problem. Too many variables and lesser rows of data will lead to such a problem.

In [26]:
rem = ['max.length_rectangularity','scaled_radius_of_gyration','skewness_about.2','scatter_ratio','elongatedness','pr.axis_rectangularity',
    'scaled_variance','scaled_variance.1']
df_clean.drop(rem,axis = 1, inplace = True)

Scaling of the data using Standard Scaler

Data needs to be scaled for any distance based algorithm such as clustering

In [27]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df_clean = sc.fit_transform(df_clean)

Finding Optimum number of Clusters

Cluster = 3 has the Silhoutte Score of 0.2399 and a good dip in average distortion

In [28]:
from scipy.spatial.distance import cdist
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

clusters = range(2,10)
meandistortion = []

for k in clusters:
    model = KMeans(n_clusters = k)
    model = model.fit(df_clean)
    prediction = model.predict(df_clean)
    meandistortion.append(sum(np.min(cdist(df_clean,model.cluster_centers_, 'euclidean'), axis = 1))/df_clean.shape[0])
    print("For Cluster = %i, the Silhouette Score is %1.4f" %(k,silhouette_score(df_clean,model.labels_)))
    
plt.plot(clusters, meandistortion, 'bx-')
plt.xlabel('k - Number of Clusters')
plt.ylabel('Average Distortion')
plt.title('Selecting k with the Elbow method')
    
For Cluster = 2, the Silhouette Score is 0.2587
For Cluster = 3, the Silhouette Score is 0.2399
For Cluster = 4, the Silhouette Score is 0.1962
For Cluster = 5, the Silhouette Score is 0.1998
For Cluster = 6, the Silhouette Score is 0.1788
For Cluster = 7, the Silhouette Score is 0.2028
For Cluster = 8, the Silhouette Score is 0.1895
For Cluster = 9, the Silhouette Score is 0.1844
Out[28]:
Text(0.5, 1.0, 'Selecting k with the Elbow method')

K Means Clustering Algorithm

In [29]:
clus = KMeans(n_clusters = 3, random_state = 1)
clus.fit_predict(df_clean)
Out[29]:
array([2, 2, 1, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 1, 0, 0, 2,
       2, 1, 2, 0, 1, 2, 0, 2, 2, 2, 1, 2, 2, 0, 1, 2, 1, 0, 0, 2, 1, 0,
       0, 0, 2, 2, 0, 2, 1, 2, 1, 2, 2, 0, 1, 0, 1, 0, 0, 0, 2, 0, 0, 1,
       2, 1, 1, 1, 2, 0, 2, 1, 2, 0, 1, 0, 2, 1, 2, 0, 2, 2, 0, 2, 0, 1,
       2, 1, 2, 0, 1, 0, 0, 1, 0, 2, 2, 0, 2, 1, 1, 0, 0, 2, 2, 2, 0, 0,
       0, 1, 1, 1, 0, 2, 0, 0, 2, 2, 2, 0, 2, 2, 1, 1, 2, 0, 1, 0, 2, 0,
       2, 2, 0, 1, 0, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 0, 2, 2, 0,
       1, 2, 2, 1, 1, 2, 1, 0, 0, 1, 0, 2, 1, 2, 2, 2, 2, 2, 0, 1, 0, 2,
       0, 1, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 0, 1, 0, 0, 0, 2, 2, 1, 1, 2,
       2, 2, 0, 0, 1, 2, 2, 2, 0, 0, 2, 0, 1, 0, 2, 1, 0, 2, 0, 0, 2, 1,
       2, 1, 0, 0, 0, 0, 1, 2, 0, 2, 0, 1, 0, 2, 2, 0, 1, 0, 0, 2, 2, 1,
       0, 0, 1, 0, 2, 2, 1, 2, 2, 1, 1, 0, 2, 2, 2, 1, 0, 0, 2, 2, 0, 0,
       2, 2, 2, 1, 2, 0, 0, 1, 2, 2, 0, 0, 1, 0, 2, 2, 2, 1, 0, 0, 2, 2,
       1, 2, 1, 0, 2, 2, 1, 2, 2, 2, 0, 2, 1, 1, 1, 1, 1, 2, 2, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 2, 0, 1, 0, 2, 2, 2, 1, 1, 0, 1, 1, 0, 1, 2,
       2, 2, 0, 0, 1, 1, 1, 1, 2, 2, 2, 1, 0, 2, 0, 1, 2, 2, 1, 2, 1, 1,
       1, 2, 2, 0, 1, 2, 0, 0, 2, 2, 2, 2, 2, 0, 1, 1, 0, 0, 1, 0, 1, 0,
       1, 2, 2, 2, 2, 1, 0, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 0, 0,
       2, 2, 2, 0, 0, 2, 0, 1, 2, 2, 2, 2, 0, 1, 2, 0, 2, 2, 1, 2, 1, 2,
       1, 1, 0, 0, 1, 2, 0, 0, 2, 1, 1, 0, 2, 1, 1, 0, 1, 1, 1, 2, 2, 2,
       2, 2, 1, 0, 0, 2, 1, 2, 2, 1, 2, 0, 2, 0, 0, 1, 1, 2, 2, 1, 0, 1,
       0, 1, 1, 2, 2, 0, 1, 1, 2, 2, 0, 0, 1, 2, 0, 1, 1, 2, 0, 1, 0, 2,
       1, 2, 0, 1, 1, 1, 0, 0, 1, 1, 1, 2, 2, 1, 0, 2, 1, 0, 0, 1, 0, 2,
       2, 0, 2, 1, 2, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 1, 2, 1, 1, 2, 2, 2,
       2, 2, 0, 0, 2, 2, 1, 0, 0, 2, 0, 1, 2, 1, 0, 0, 1, 1, 2, 1, 2, 2,
       1, 1, 2, 0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 2, 0, 0, 0, 1, 1, 1, 2, 1,
       0, 2, 1, 0, 0, 0, 2, 0, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2,
       0, 1, 0, 0, 2, 0, 2, 2, 0, 0, 1, 1, 2, 2, 0, 1, 2, 2, 1, 2, 0, 1,
       0, 1, 0, 0, 2, 0, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 1, 2, 1, 0, 2, 2,
       1, 0, 2, 2, 2, 1, 2, 1, 0, 2, 2, 2, 1, 1, 1, 0, 1, 2, 2, 2, 2, 1,
       0, 1, 0, 2, 2, 2, 0, 1, 2, 0, 2, 0, 1, 2, 2, 1, 0, 2, 0, 2, 2, 0,
       2, 1, 1, 2, 2, 1, 1, 2, 0, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1,
       2, 1, 2, 0, 1, 2, 0, 1, 1, 1, 2, 0, 0, 1, 1, 1, 2, 1, 2, 2, 1, 2,
       0, 2, 0, 2, 1, 2, 0, 2, 2, 2, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 2, 2,
       1, 2, 0, 1, 1, 0, 2, 2, 1, 0, 1, 0, 1, 2, 2, 1, 0, 0, 1, 0, 1, 2,
       0, 2, 1, 2, 2, 0, 2, 1, 1, 2, 2, 0, 2, 2, 1, 0, 2, 1, 0, 0, 1, 0,
       2, 0, 0, 0, 2, 1, 1, 2, 0, 1, 2, 1, 1, 0, 2, 1, 0, 0, 2, 2, 1, 0,
       0, 0, 2, 2, 2, 2, 2, 2, 1, 2, 0], dtype=int32)
In [30]:
col = ['zradius_ratio',
       'zpr.axis_aspect_ratio', 'zmax.length_aspect_ratio', 'zscaled_variance',
       'zscaled_variance.1', 'zscaled_radius_of_gyration.1', 'zskewness_about',
       'zskewness_about.1']
mydata_copy.drop(col,axis = 1, inplace = True)
In [31]:
plt.scatter(df_clean[:,0], df_clean[:,1], c=clus.labels_)
plt.show()

Hierarchial Clustering Method

In [32]:
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(df_clean, 'ward', metric = 'euclidean')
Z.shape
Out[32]:
(824, 4)
In [33]:
plt.figure(figsize=(25, 10))
dendrogram(Z)
plt.show()
In [34]:
dendrogram(Z,truncate_mode='lastp',p=3)
plt.show()
In [35]:
from scipy.cluster.hierarchy import fcluster
max_d=40
clusters = fcluster(Z, max_d, criterion='distance')
In [36]:
plt.scatter(df_clean[:,0], df_clean[:,1], c=clusters)  # plot points with cluster dependent colors
plt.show()

Comparing the results of KMeans Cluster and Hierarchial Cluster

In [37]:
mydata_copy.head()
Out[37]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
5 107 44.0 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
In [38]:
mydata_copy['HCluster'] = clusters
mydata_copy['KCluster'] = clus.labels_
df_final = mydata_copy.copy()
df_final.drop('class', inplace = True, axis = 1)
In [39]:
df_final.groupby('HCluster').median()
Out[39]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio KCluster
HCluster
1 103 52.0 101.5 202.0 63.0 10 212.0 31.0 24.0 163 223.0 665.0 211.5 71.0 6.0 13.0 189.0 198 1
2 92 41.0 77.0 169.0 63.0 7 153.0 43.0 19.0 140 174.0 350.5 152.5 67.0 6.0 11.0 195.0 202 2
3 86 42.0 70.0 136.0 57.0 7 149.0 45.0 19.0 143 169.0 324.0 162.0 76.0 5.0 9.0 183.0 189 0
In [40]:
df_final.groupby('KCluster').median()
Out[40]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio HCluster
KCluster
0 86 42.0 68.0 130.0 56.0 7 148.0 46.0 19.0 142 168.0 320.0 164.0 78.0 5.0 9.0 182.0 186 3
1 103 53.0 103.0 203.0 63.0 10 212.0 31.0 24.0 164 223.0 667.0 212.0 71.0 6.0 14.0 189.0 198 1
2 91 42.0 77.0 166.0 63.0 8 154.0 43.0 19.0 141 175.0 353.0 157.0 68.0 5.0 10.0 193.0 200 2
In [41]:
df_final['HC'] = np.where(df_final['HCluster'] == 1, 1, np.where(df_final['HCluster'] == 2, 2, 0))
In [42]:
df_final['Match'] = np.where(df_final['KCluster'] == df_final['HC'], "True", "False")
In [43]:
df_final.groupby('Match')['Match'].count()
Out[43]:
Match
False    123
True     702
Name: Match, dtype: int64
In [44]:
print("Accuracy percentage of match between KMeans and HCluster is %2.2f" %(100*df_final.groupby('Match')['Match'].count()[1]/df_final.shape[0]))
Accuracy percentage of match between KMeans and HCluster is 85.09

Classification Technique - SVM

In [45]:
c = mydata_copy.copy()
c.drop(['HCluster', 'KCluster'], axis = 1, inplace = True)
c.drop(rem, axis = 1, inplace = True)
In [46]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
c['class'] = le.fit_transform(c['class'])
In [47]:
y = c['class']
X = c.drop(['class'], axis = 1)

Test Train Data Split

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1, test_size = 0.3)

Data Scale

In [49]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

Applying the SVM Model

In [50]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn import metrics
svm = SVC(probability = True, random_state = 0)
svm.fit(X_train, y_train)
svmpredict = svm.predict(X_test)
print(classification_report(y_test, svmpredict))
print("Accuracy Score is %5.3f " %(accuracy_score(y_test, svmpredict) * 100))
              precision    recall  f1-score   support

           0       0.97      1.00      0.98        58
           1       0.91      0.98      0.94       133
           2       0.96      0.75      0.84        57

    accuracy                           0.93       248
   macro avg       0.94      0.91      0.92       248
weighted avg       0.93      0.93      0.93       248

Accuracy Score is 93.145 
In [51]:
cm_svm = metrics.confusion_matrix(y_test, svmpredict, labels = [2,1,0])
df_cm_svm = pd.DataFrame(cm_svm, index = [i for i in ["2","1", "0"]], columns = [i for i in ["Predict 2", "Predict 1", "Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm_svm, annot = True, cmap = "Greens", fmt='g')
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x1392a7908>

K Fold Cross Validation Score for SVM

In [52]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = svm, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Accuracy: 93.06 %
Standard Deviation: 1.89 %

Principal Component Analysis - Feature Extraction

In [53]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 6, random_state = 0)
PCAX_train = pca.fit_transform(X_train)
PCAX_test = pca.transform(X_test)
variance = pca.explained_variance_ratio_

Variance Captured by PCA Components

In [54]:
print(variance)
[0.42145417 0.18336633 0.11008503 0.10786816 0.07744109 0.05309206]

Applying SVM Classification with PCA Components

In [55]:
svmpca = SVC(probability = True, random_state = 0)
svmpca.fit(PCAX_train, y_train)
svmpredictpca = svmpca.predict(PCAX_test)
print(classification_report(y_test, svmpredictpca))
print("Accuracy Score is %5.3f " %(accuracy_score(y_test, svmpredictpca) * 100))
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        58
           1       0.86      0.92      0.89       133
           2       0.81      0.61      0.70        57

    accuracy                           0.87       248
   macro avg       0.86      0.84      0.85       248
weighted avg       0.86      0.87      0.86       248

Accuracy Score is 86.694 
In [56]:
cm_svm = metrics.confusion_matrix(y_test, svmpredictpca, labels = [2,1,0])
df_cm_svm = pd.DataFrame(cm_svm, index = [i for i in ["2","1", "0"]], columns = [i for i in ["Predict 2", "Predict 1", "Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm_svm, annot = True, cmap = "Greens", fmt='g')
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x1394dcf98>

K Fold Cross Validation Score

In [57]:
accuraciespca = cross_val_score(estimator = svmpca, X = PCAX_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuraciespca.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuraciespca.std()*100))
Accuracy: 87.51 %
Standard Deviation: 3.48 %

PCA Components wise model accuracies

In [58]:
for components in range (2,10):
    pca = PCA(n_components = components, random_state = 0)
    PCAX_train = pca.fit_transform(X_train)
    PCAX_test = pca.transform(X_test)
    svmpca.fit(PCAX_train, y_train)
    svmpredictpca1 = svmpca.predict(PCAX_test)
    accuraciespca = cross_val_score(estimator = svmpca, X = PCAX_train, y = y_train, cv = 10)
    print("For %i Components:" %(components))
    print("Accuracy: {:.2f} %".format(accuraciespca.mean()*100))
    print("Standard Deviation: {:.2f} %".format(accuraciespca.std()*100))
    print()
    
For 2 Components:
Accuracy: 63.27 %
Standard Deviation: 5.48 %

For 3 Components:
Accuracy: 75.73 %
Standard Deviation: 5.72 %

For 4 Components:
Accuracy: 78.32 %
Standard Deviation: 5.59 %

For 5 Components:
Accuracy: 83.20 %
Standard Deviation: 5.02 %

For 6 Components:
Accuracy: 87.51 %
Standard Deviation: 3.48 %

For 7 Components:
Accuracy: 88.88 %
Standard Deviation: 3.00 %

For 8 Components:
Accuracy: 90.79 %
Standard Deviation: 2.98 %

For 9 Components:
Accuracy: 92.36 %
Standard Deviation: 1.97 %

In [59]:
mean=[]
std=[]
for components in range (2,10):
    pca = PCA(n_components = components, random_state = 0)
    PCAX_train = pca.fit_transform(X_train)
    PCAX_test = pca.transform(X_test)
    svmpca.fit(PCAX_train, y_train)
    svmpredictpca1 = svmpca.predict(PCAX_test)
    accuraciespca = cross_val_score(estimator = svmpca, X = PCAX_train, y = y_train, cv = 10)
    mean.append(accuraciespca.mean()*100)
   
    

Curve of PCA Components and Model Accuracy

Ideal PCA Components are 6

In [60]:
plt.plot(range(2,10), mean, label = "Mean")
plt.xlabel("PCA Components")
plt.ylabel("Model Accuracy Percentage")
plt.title("PCA Components Vs Model Accuracy %")
plt.legend(loc = 4)
plt.show()

Another method for PCA

In [61]:
sc = StandardScaler()
X_standard = sc.fit_transform(X)

Construct a covariance matrix

In [62]:
covariance_matrix = np.cov(X_standard.T)
print(covariance_matrix)
[[ 1.00121359  0.68291951  0.78823425  0.75296713  0.2182003   0.46682272
  -0.29304047  0.22918182  0.15377127  0.3973032 ]
 [ 0.68291951  1.00121359  0.79229919  0.65062721  0.22113312  0.54248891
   0.02772258  0.14864088 -0.01523443  0.07114321]
 [ 0.78823425  0.79229919  1.00121359  0.81587461  0.2621709   0.62939673
  -0.28320585  0.11088651  0.26605663  0.35861409]
 [ 0.75296713  0.65062721  0.81587461  1.00121359  0.67244556  0.44688438
  -0.42884537  0.0568428   0.18496902  0.52487037]
 [ 0.2182003   0.22113312  0.2621709   0.67244556  1.00121359  0.16621732
  -0.31930019 -0.04922885 -0.0259603   0.41444265]
 [ 0.46682272  0.54248891  0.62939673  0.44688438  0.16621732  1.00121359
  -0.303038    0.08503218  0.13723813  0.36110332]
 [-0.29304047  0.02772258 -0.28320585 -0.42884537 -0.31930019 -0.303038
   1.00121359 -0.0906261  -0.13838782 -0.91709931]
 [ 0.22918182  0.14864088  0.11088651  0.0568428  -0.04922885  0.08503218
  -0.0906261   1.00121359 -0.04793736  0.08692039]
 [ 0.15377127 -0.01523443  0.26605663  0.18496902 -0.0259603   0.13723813
  -0.13838782 -0.04793736  1.00121359  0.20469662]
 [ 0.3973032   0.07114321  0.35861409  0.52487037  0.41444265  0.36110332
  -0.91709931  0.08692039  0.20469662  1.00121359]]

Calculate the eigenvalues and eigenvectors

In [63]:
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
print(eigenvalues)
print(eigenvectors)
[4.37845032 1.73052184 1.10432199 1.06680761 0.75707542 0.52185321
 0.20948233 0.13786129 0.0410773  0.06468462]
[[ 0.39698523 -0.19715283  0.10808393  0.0660961  -0.08057256  0.47183884
  -0.68254848  0.19072259  0.20333485 -0.12151639]
 [ 0.33990805 -0.4618814  -0.1051219   0.03906411  0.07397558  0.02661113
   0.52202757  0.59787208  0.00136179 -0.14968391]
 [ 0.42495483 -0.23842374  0.07552229 -0.13234004  0.02747136  0.1083981
   0.29415085 -0.58890502  0.43452429  0.32357376]
 [ 0.43857141 -0.00786211 -0.25133836 -0.05607487 -0.24328717  0.08979955
   0.01077663 -0.30252708 -0.75657965 -0.09986891]
 [ 0.24576761  0.23187682 -0.61947865  0.04140146 -0.38180941 -0.4314506
  -0.12705501  0.09776033  0.37689752 -0.02614755]
 [ 0.32296438 -0.12050785  0.17077018 -0.05551779  0.57076838 -0.6454811
  -0.30085392 -0.03650903 -0.10634658 -0.05787242]
 [-0.2671588  -0.56677984 -0.14463316 -0.13758237 -0.2140358  -0.16235234
  -0.24777482  0.10520039 -0.16963999  0.62779958]
 [ 0.0813994  -0.09644531  0.45244274  0.72300735 -0.41918311 -0.27069871
   0.04936803 -0.06276847 -0.02978255  0.01928792]
 [ 0.11406186  0.11053866  0.50753969 -0.64759501 -0.46634667 -0.20520473
   0.03398151  0.16465175  0.01756652 -0.09815059]
 [ 0.31339472  0.52899882  0.11070991  0.08719455  0.13240984  0.11092936
   0.0577371   0.33887301 -0.11871947  0.66331284]]

Form eigenpairs - Each pair will be an eigenvalue and the column wise eigenvector values

In [64]:
eigenpair = [(eigenvalues[i], eigenvectors[:,i]) for i in range(len(eigenvalues))]
In [65]:
print(eigenpair)
[(4.3784503156759085, array([ 0.39698523,  0.33990805,  0.42495483,  0.43857141,  0.24576761,
        0.32296438, -0.2671588 ,  0.0813994 ,  0.11406186,  0.31339472])), (1.7305218422769315, array([-0.19715283, -0.4618814 , -0.23842374, -0.00786211,  0.23187682,
       -0.12050785, -0.56677984, -0.09644531,  0.11053866,  0.52899882])), (1.1043219904977333, array([ 0.10808393, -0.1051219 ,  0.07552229, -0.25133836, -0.61947865,
        0.17077018, -0.14463316,  0.45244274,  0.50753969,  0.11070991])), (1.066807609794081, array([ 0.0660961 ,  0.03906411, -0.13234004, -0.05607487,  0.04140146,
       -0.05551779, -0.13758237,  0.72300735, -0.64759501,  0.08719455])), (0.7570754174362223, array([-0.08057256,  0.07397558,  0.02747136, -0.24328717, -0.38180941,
        0.57076838, -0.2140358 , -0.41918311, -0.46634667,  0.13240984])), (0.5218532081071776, array([ 0.47183884,  0.02661113,  0.1083981 ,  0.08979955, -0.4314506 ,
       -0.6454811 , -0.16235234, -0.27069871, -0.20520473,  0.11092936])), (0.20948233301054342, array([-0.68254848,  0.52202757,  0.29415085,  0.01077663, -0.12705501,
       -0.30085392, -0.24777482,  0.04936803,  0.03398151,  0.0577371 ])), (0.13786128511162127, array([ 0.19072259,  0.59787208, -0.58890502, -0.30252708,  0.09776033,
       -0.03650903,  0.10520039, -0.06276847,  0.16465175,  0.33887301])), (0.0410773030141572, array([ 0.20333485,  0.00136179,  0.43452429, -0.75657965,  0.37689752,
       -0.10634658, -0.16963999, -0.02978255,  0.01756652, -0.11871947])), (0.0646846174057235, array([-0.12151639, -0.14968391,  0.32357376, -0.09986891, -0.02614755,
       -0.05787242,  0.62779958,  0.01928792, -0.09815059,  0.66331284]))]

Sort the eigenpair in descending order

In [66]:
eigenpair.sort()
eigenpair.reverse()
In [67]:
print(eigenpair)
[(4.3784503156759085, array([ 0.39698523,  0.33990805,  0.42495483,  0.43857141,  0.24576761,
        0.32296438, -0.2671588 ,  0.0813994 ,  0.11406186,  0.31339472])), (1.7305218422769315, array([-0.19715283, -0.4618814 , -0.23842374, -0.00786211,  0.23187682,
       -0.12050785, -0.56677984, -0.09644531,  0.11053866,  0.52899882])), (1.1043219904977333, array([ 0.10808393, -0.1051219 ,  0.07552229, -0.25133836, -0.61947865,
        0.17077018, -0.14463316,  0.45244274,  0.50753969,  0.11070991])), (1.066807609794081, array([ 0.0660961 ,  0.03906411, -0.13234004, -0.05607487,  0.04140146,
       -0.05551779, -0.13758237,  0.72300735, -0.64759501,  0.08719455])), (0.7570754174362223, array([-0.08057256,  0.07397558,  0.02747136, -0.24328717, -0.38180941,
        0.57076838, -0.2140358 , -0.41918311, -0.46634667,  0.13240984])), (0.5218532081071776, array([ 0.47183884,  0.02661113,  0.1083981 ,  0.08979955, -0.4314506 ,
       -0.6454811 , -0.16235234, -0.27069871, -0.20520473,  0.11092936])), (0.20948233301054342, array([-0.68254848,  0.52202757,  0.29415085,  0.01077663, -0.12705501,
       -0.30085392, -0.24777482,  0.04936803,  0.03398151,  0.0577371 ])), (0.13786128511162127, array([ 0.19072259,  0.59787208, -0.58890502, -0.30252708,  0.09776033,
       -0.03650903,  0.10520039, -0.06276847,  0.16465175,  0.33887301])), (0.0646846174057235, array([-0.12151639, -0.14968391,  0.32357376, -0.09986891, -0.02614755,
       -0.05787242,  0.62779958,  0.01928792, -0.09815059,  0.66331284])), (0.0410773030141572, array([ 0.20333485,  0.00136179,  0.43452429, -0.75657965,  0.37689752,
       -0.10634658, -0.16963999, -0.02978255,  0.01756652, -0.11871947]))]

Separate the sorted eigenvalues and eigenvectors

In [68]:
eigenvalues_sorted = [eigenpair[i][0] for i in range(len(eigenpair))]
eigenvectors_sorted = [eigenpair[i][1] for i in range(len(eigenpair))]

Calculate the explained variance ratio to the sum of total variance

In [69]:
total_variance = sum(eigenvalues_sorted)
for i in range(len(eigenvalues_sorted)):
    variance_explained = eigenvalues_sorted[i]/total_variance
    print(variance_explained)
0.4373143103172058
0.17284242400438682
0.11029834183880388
0.1065514509660997
0.07561577502635722
0.05212206587640173
0.020922841503113666
0.013769418052360715
0.006460621180886806
0.0041027512343837

Add the cumulative sum of variance explained

In [70]:
variance_explained = [(eigenvalues_sorted[i]/total_variance)for i in range(len(eigenvalues_sorted))]
cumulative_variance = np.cumsum(variance_explained)

Draw a bar and step chart for visualization

In [71]:
plt.bar(range(len(eigenvalues_sorted)), variance_explained, label = "Individual Variance Explained")
plt.step(range(len(eigenvalues_sorted)), cumulative_variance, label = "Cumulative Variance Explained")
plt.legend(loc = 'best')
Out[71]:
<matplotlib.legend.Legend at 0x13b7bb630>

Convert the scaled original dataset and do a dot product to the eigenvectors. This will provide a transformed dataset

In [72]:
PCAReduced = np.array(eigenvectors_sorted[0:6])
X_PCAReduced = np.dot(X_standard,PCAReduced.T)
df_PCAReduced = pd.DataFrame(X_PCAReduced)

Do a pairplot with the PCA Transformed Dataset

In [73]:
sns.pairplot(df_PCAReduced)
Out[73]:
<seaborn.axisgrid.PairGrid at 0x13b801390>

Check if there is any correlation for the PCA Transformed Data set

There is no correlation and each independent variable is 'independent'

In [74]:
df_PCAReduced.corr().round(5)
Out[74]:
0 1 2 3 4 5
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 -0.0 0.0
3 0.0 0.0 0.0 1.0 -0.0 -0.0
4 0.0 0.0 -0.0 -0.0 1.0 -0.0
5 0.0 0.0 0.0 -0.0 -0.0 1.0

Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings

SVM Without PCA and using Raw Data

In [75]:
print(classification_report(y_test, svmpredict))
print("Accuracy Score is %5.3f " %(accuracy_score(y_test, svmpredict) * 100))
              precision    recall  f1-score   support

           0       0.97      1.00      0.98        58
           1       0.91      0.98      0.94       133
           2       0.96      0.75      0.84        57

    accuracy                           0.93       248
   macro avg       0.94      0.91      0.92       248
weighted avg       0.93      0.93      0.93       248

Accuracy Score is 93.145 

SVM After PCA

In [76]:
print(classification_report(y_test, svmpredictpca))
print("Accuracy Score is %5.3f " %(accuracy_score(y_test, svmpredictpca) * 100))
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        58
           1       0.86      0.92      0.89       133
           2       0.81      0.61      0.70        57

    accuracy                           0.87       248
   macro avg       0.86      0.84      0.85       248
weighted avg       0.86      0.87      0.86       248

Accuracy Score is 86.694 

Findings

SVM Before PCA

  • 10 Independent Variables were taken for the training and test set. We achieved an accuracy of 93.14%
  • Average Precision and Recall scores are 94% and 91% which is a good score. The F1 score is 93%.
  • We find that Recall value is 75% for classifying correctly as 2 when the original data set value for the same row is 2. Out of 4 times, only once we are unable to classify them correctly. This is a measure of True Negatives.

Now, when we do a PCA we are trying to eliminate the noise from the data and forming a new linear combination of independent variables into fewer variables.
The eigenvalue is the sum of maximum distance from the origin to the point where the data is projected onto the eigen vector.
The eigen vector when dot matrix multiplied to the original data set, we arrive at the PCA reduced dataset.
When we are reducing the number of variables, we can expect a reduction in the accuracy scores but at a benefit of handling lesser independent variables.

SVM After PCA

  • 6 PCA Components were taken for the training and test set. We achieved an accuracy of 86.69% after reducing to 6 components from 10 original independent variables
  • Average Precision and Recall scores are 86% and 84% which is a good score. The F1 score is 87%. We see that there is no big reduction in the F1 score.
  • We find that Recall value is 61% for classifying correctly as 2 when the original data set value for the same row is 2. Out of 10 times, four times we are unable to classify them correctly. This is a measure of True Negatives.

PCA does a good job and provides us a good accuracy score, when using production data we have to subject to the PCA reduction and then apply the classification algorithms.

In [ ]: